Dark Mode
LINEA
LINEA is an R library aimed at simplifying and accelerating the development of linear models to understand the relationship between two or more variables.
Linear models are commonly used in a variety of contexts including natural and social sciences, and various business applications (e.g. Marketing, Finance).
This page covers a basic implementation of the linea library to analyse a time-series. We’ll cover:
We will run a simple model on some fictitious data sourced from Google trends. The aim of this exercise will be understand what variables seem to have an impact on the ecommerce variable.
The library can be installed from github as follows:
# cran version
# install.packages('linea')
# development version
# devtools::install_github('paladinic/linea')
Once installed you can check the installation.
print(packageVersion("linea"))
## [1] '0.0.1'
The linea library works well with pipes. Used with dplyr and plotly, it perform data analysis and visualisation with elegant code.
library(linea) # the library in question
library(dplyr) # for pipes (%>%) and data manipulation
library(plotly) # for interactive charts
library(DT) # for interactive tables
The function linea::read_xcsv() can be used to read csv or excel files.
# data_path = 'c:/Users/44751/Desktop/github/data/ecomm_data.csv'
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'
data = read_xcsv(file = data_path)
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
As shown above, the data contains several variables including the ecommerce variable, other numeric variables, and a date-type variable (i.e. date). With this data we can start building models to understand which variables have an impact on ecommerce. The linea::run_model() function can be used to run an OLS regression model. Some of the function’s arguments are:
The function will return an “lm” object like the one from the stats::lm() function which can be inspected with the base::summary() function.
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('covid','christmas'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -34222 -5723 -106 4361 64271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56339.61 690.61 81.58 <2e-16 ***
## covid 336.41 19.61 17.16 <2e-16 ***
## christmas 383.15 30.65 12.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8947 on 258 degrees of freedom
## Multiple R-squared: 0.6339, Adjusted R-squared: 0.6311
## F-statistic: 223.4 on 2 and 258 DF, p-value: < 2.2e-16
Models can be inspected visually using the linea::decomping() function. Some of the function’s arguments are:
decomposition = model %>% decomping()
print(names(decomposition))
## [1] "category_decomp" "variable_decomp" "fitted_values"
The decomposition object is a list of 3 data frames. These can be viewed directly using the functions linea::fit_chart() and linea::decomp_chart().
The first 2, variable_decomp and category_decomp, capture the role of individual variables in the model (categories can be set to group variables).
decomposition$variable_decomp %>%
datatable(rownames = NULL,
options = list(scrollX = T))
The linea::decomp_chart() function can be used to display a stacked bar chart of the decomposition.
model %>%
decomp_chart()
The fitted_values dataframe instead contains the dependent variable (actual), the model prediction (model), and the error (residual).
decomposition$fitted_values %>%
datatable(rownames = NULL,
options = list(scrollX = T))
The linea::fit_chart() function can be used to display a line chart of the Prediction, Actual, and Error.
model %>%
fit_chart()
The linea::acf_chart() and linea::resid_hist_chart() functions can be used to assess your model as per the assumptions of linear regression:
Using the linea::acf_chart() function we can visualize the ACF, which helps us detect Autocorrelation.
model %>%
acf_chart()
Using the linea::resid_hist_chart() function we can visualize the distribution on residuals, which helps us detect Residual Normality.
model %>%
resid_hist_chart()
Using the linea::response_curves() function we can visualize the relationship between the independent variables and the dependent variable.
model %>%
response_curves()
Using a date column, of data-type date, we can generate seasonality variables with linea::get_seasonality(). Several columns will be added to the original dataframe. These are dummy variables that capture some basic holidays as well as year, month, and week number. Also a trend variable is added which is a column that goes form 1 to n, where n is the number of rows.
data = data %>%
get_seasonality(
date_col_name = 'date',
date_type = 'weekly ending')
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
plot_ly(data) %>%
add_bars(y = ~ week_26,
x = ~ date,
name = 'week_26',
color = color_palette()[1]) %>%
add_bars(y = ~ new_years_eve,
x = ~ date,
name = 'new_years_eve',
color = color_palette()[2]) %>%
add_bars(y = ~ year_2019,
x = ~ date,
name = 'year_2019',
color = color_palette()[3]) %>%
layout(yaxis = list(title = 'value'),
title = 'Seasonality Variables',
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
These variables can be used in the model to capture the seasonal component of the dependent variable, among other things (e.g. trend).
model = run_model(data = data,
dv = 'ecommerce',
ivs = c('covid', 'christmas', 'trend','month_Dec'),
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -33092 -3553 -1086 3033 60996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47639.504 1072.631 44.414 < 2e-16 ***
## covid 180.013 23.508 7.657 3.89e-13 ***
## christmas 446.518 39.255 11.375 < 2e-16 ***
## trend 84.042 8.825 9.524 < 2e-16 ***
## month_Dec -7638.941 2650.433 -2.882 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7626 on 256 degrees of freedom
## Multiple R-squared: 0.7361, Adjusted R-squared: 0.732
## F-statistic: 178.5 on 4 and 256 DF, p-value: < 2.2e-16
Thanks to the new variables this model has a better r-squared compared to the previous. These variables’ impact can be clearly seen using the linea::decomping() and linea::decomp_chart() functions.
model %>%
decomp_chart()
linea provides a few default transformations. These are mathematical functions meant to capture non linear relationships in the data. These functions are embedded in the library but are also available for use.
The linea::decay() function applies a decay by adding to each data point a percentage of the previous. This transformation is meant to capture the impact, over time, of an event. This function only makes sense on time-bound models.
raw_variable = data$online_media
dates = data$date
plot_ly() %>%
add_lines(y = raw_variable, x = dates, name = 'raw') %>%
add_lines(y = decay(raw_variable, decay = 0.5),
x = dates,
name = 'transformed: decay 50%') %>%
add_lines(y = decay(raw_variable, decay = 0.75),
x = dates,
name = 'transformed: decay 75%') %>%
add_lines(y = decay(raw_variable, decay = 0.95),
x = dates,
name = 'transformed: decay 95%') %>%
layout(title = 'decay',
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
The linea::diminish() function applies a negative exponential function:
\[\ 1 - e^{-v/m} \]
or..
\[\ 1- \frac{1}{e^{v/m}} \] Where v is the vector to be transformed and m defines the shape of the transformation. Here is a visualization of the transformation.
raw_variable = data$offline_media
dates = data$date
plot_ly() %>%
add_lines(y = raw_variable, x = dates, name = 'raw') %>%
add_lines(
y = diminish(raw_variable, m = 0.3, abs = F),
x = dates,
name = 'transformed: diminish 30%',
yaxis = "y2"
) %>%
layout(title = 'diminish',
yaxis2 = list(overlaying = "y",
showgrid = F,
side = "right"),
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
This transformation can also be visualized by placing the raw and transformed variable on the horizontal and vertical axis.
plot_ly() %>%
add_lines(
x = raw_variable,
y = diminish(raw_variable,.25,F),
name = 'diminish 25%',
line = list(shape = "spline")
) %>%
add_lines(
x = raw_variable,
y = diminish(raw_variable,.5,F),
name = 'diminish 50%',
line = list(shape = "spline")
) %>%
add_lines(
x = raw_variable,
y = diminish(raw_variable,.75,F),
name = 'diminish 75%',
line = list(shape = "spline")
) %>%
layout(title = 'raw vs. diminished',
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
plot_ly() %>%
add_trace(
x = raw_variable,
y = diminish(raw_variable,.5,F)
) %>%
layout(title = 'raw vs. diminished (m = 10%)',
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
The linea::hill_function() function applies a negative exponential function by adding to each data point a percentage of the previous. This transformation is meant to capture relationships that exhibit diminishing returns.
raw_variable = data$ecommerce
dates = data$date
plot_ly() %>%
add_lines(y = raw_variable, x = dates, name = 'raw') %>%
add_lines(
y = diminish(raw_variable, m = 0.3, abs = F),
x = dates,
name = 'transformed: diminish 30%',
yaxis = "y2"
) %>%
layout(title = 'diminish',
yaxis2 = list(overlaying = "y",
showgrid = F,
side = "right"),
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
This transformation can also be visualized by placing the raw and transformed variable on the horizontal and vertical axis.
plot_ly() %>%
add_trace(
x = raw_variable,
y = diminish(raw_variable,.3,F)
) %>%
layout(title = 'raw vs. diminished',
xaxis = list(showgrid = F),
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
The linea::lag() function applies a lag to the data. This transformation is meant to capture relationships that are lagged in time. This function only makes sense on time-bound models.
raw_variable = data$ecommerce
dates = data$date
plot_ly() %>%
add_lines(y = raw_variable, x = dates, name = 'raw') %>%
add_lines(
y = lag(raw_variable, n = 1),
x = dates,
name = 'transformed: lag 1',
) %>%
add_lines(
y = lag(raw_variable, n = 5),
x = dates,
name = 'transformed: lag 5',
) %>%
add_lines(
y = lag(raw_variable, n = 10),
x = dates,
name = 'transformed: lag 10',
) %>%
layout(plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)",
title = 'lag',
xaxis = list(showgrid = F))
The linea::ma() function applies a moving average to the data. This transformation is meant to capture relationships that are smoothed over time. This function only makes sense on time-bound models.
raw_variable = data$black.friday
dates = data$date
plot_ly() %>%
add_lines(y = raw_variable, x = dates, name = 'raw') %>%
add_lines(
y = ma(raw_variable, width = 5),
x = dates,
name = 'transformed: ma 5',
) %>%
add_lines(
y = ma(raw_variable, width = 15),
x = dates,
name = 'transformed: ma 15',
) %>%
add_lines(
y = ma(raw_variable, width = 25),
x = dates,
name = 'transformed: ma 25',
) %>%
add_lines(
y = ma(raw_variable, width = 25,align = 'left'),
x = dates,
name = 'transformed: lag 25 left',
) %>%
layout(plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)",
xaxis = list(showgrid = F),
title='ma')
linea can capture non-linear relationships by applying transformations to the raw data, and then generating the regression for the transformed data. This can be accomplished using a model table which specifies each variable’s transformation parameters. The function linea::build_model_table() can be used to generate the blank model table.
model_table = build_model_table(ivs = c('covid','christmas','trend'))
model_table %>%
datatable(rownames = NULL,
options = list(scrollX = T,
dom = "t"))
The model table can be written as a CSV or Excel and modified outside of R, or using dplyr as shown below. In this example the model run will apply the linea::dim_rets() function (with a parameter of 0.5, to the “covid” variable.
model_table = model_table %>%
mutate(diminish = if_else(variable == 'covid','10',diminish)) %>%
mutate(decay = if_else(variable == 'covid','.5',decay))
model_table %>%
datatable(rownames = NULL,
options = list(scrollX = T,
dom = "t"))
The model table can be used as an input in the linea::run_model() function. The linea::response_curves() function will display the non-linear relationship capture by the model.
model = run_model(data = data,
dv = 'ecommerce',
model_table = model_table)
model %>%
response_curves(
x_min = 0,
x_max = 30,
y_min = 0,
y_max = 20000,
interval = 0.01
)
The function linea::what_next() can be used to run a model for each of the unused variables in the data. The function will run the current model specification with the additional unused variable. This will return a data.frame containing the statistics of each model run.
model %>%
what_next() %>%
datatable(rownames = NULL)
## Warning: model object does not contain 'meta_data'.
## Warning: model object does not contain 'id_var'.
Similarly, to find the right parameters for the non-linear relationship, the function linea::what_trans() can be used to run multiple models with a range of parameters. If parameters are passed for multiple transformations, the function will run models for all combinations. The inputs for this function are:
The trans_df can be built as follows:
trans_df = data.frame(
name = c('diminish', 'decay', 'lag', 'ma'),
func = c(
'linea::diminish(x,a)',
'linea::decay(x,a)',
'linea::lag(x,a)',
'linea::ma(x,a)'
),
order = 1:4
) %>%
dplyr::mutate(val = '') %>%
dplyr::mutate(val = dplyr::if_else(condition = name == 'diminish',
'0.5,1,10,100,1000',
val))
trans_df %>%
datatable(rownames = NULL)
model %>%
what_trans(trans_df = trans_df,
variable ='offline_media') %>%
datatable(rownames = NULL)
Google Trends can be a very useful source of data as google search volumes are often correlated with events and can be used as a proxy for a missing variable. The function linea::gt_f() will return the original dataframe with the added google trends variable.
# data = data %>%
# gt_f(kw = 'ramadan',append = T) %>%
# gt_f(kw = 'trump',append = T) %>%
# gt_f(kw = 'covid',append = T)
#
# data %>%
# datatable(options = list(scrollX = T),rownames = NULL)
The output of the linea::decomp_chart() function can be grouped based on a data.frame mapping variables to categories. This helps simplify the visualisation and provide focus on specific groups of variables.
categories = data.frame(
variable = names(model$coefficients),
category = c('Base','covid','seasonality','Base'),
calc = rep('none',4)
)
model = run_model(
data = data,dv = 'ecommerce',
model_table = model_table,
categories = categories,
id_var = 'date'
)
model %>%
decomp_chart(variable_decomp = F)
Coming soon